Data Partitioning and Feature Engineering

Author
Affiliation

Nakul R. Padalkar

Boston University

Published

September 29, 2023

Modified

November 12, 2025

1 Interaction terms in regression

Interaction terms enter the regression equation as the product of two constitutive terms, x_1 and x_2. For this product term x_1x_3, the regression equation adds a separate coefficient \beta_{3}.

y=\beta_{0}+\beta_{1}x_1+\beta_{2}x_2+\beta_{3}x_1x_2+\epsilon

2 Example 1: one binary, one continuous term

In thinking about interaction terms, it helps to first simplify by working through the prediction of the regression equation for different values of two predictors, x_1 and x_2. We can imagine a continuous outcome y, e.g. the income of Hollywood actors, that we predict with two variables. The first, x_1, is a binary variable such as female gender; it takes on values of 0 (males) and 1 (females) only. The second, x_2, is a continuous variable, ranging from −5 to +5, such as a (centered and standardized) measure of age. In this case, I’m using the term “effect” loosely and non-causally. An interaction term expresses the idea that the effect of one variable depends on the value of the other variable. With these variables, this suggests that effect of age on actors’ income is different for male actors than for female actors.

  1. \beta_{1} is the effect of x_1 on y when x_2 is 0:
    1. \hat{y}=\beta_{0}+\beta_{1}x_1+\beta_{2} \times 0+\beta_{3}x_1 \times 0
    2. \hat{y}=\beta_{0}+\beta_{1}x_1+0+0
  2. \beta_{2} is the effect of x_2 on y when x_1 is 0:
    1. \hat{y}=\beta_{0}+\beta_{1} \times 0+\beta_{2}x_2+\beta_{3} \times 0 \times x_2
    2. \hat{y}=\beta_{0}+\beta_{2}x_2+0
  3. When both x_1 and x_2 are not 0, \beta_{3} becomes important, and the effect of x_1 now varies with the value of x_2. We can plug in 1 for x_1 and simplify the equation as follows:
    1. \hat{y}=\beta_{0}+\beta_{1} \times 1+\beta_{2}x_2+\beta_{3} \times 1 \times x_2
    2. \hat{y}=\beta_{0}+\beta_{1}+\beta_{2}x_2+\beta_{3} \times x_2
    3. \hat{y}=(\beta_{0}+\beta_{1})+(\beta_{2}+\beta_{3}) \times x_2

With simulated data, this can be illustrated easily. I begin by simulating a dataset with 200 observations, two predictors x_1 (binary: male/female) and x_2 (continuous: standardized and centered age), and create y (continuous: income) as the linear combination of x_1, x_2, and an interaction term of the two predictors.

set.seed(123)
n.sample <- 200
x1 <- rbinom(n.sample, size = 1, prob = 0.5)
x2 <- runif(n.sample, -5, 5)
a <- 5
b1 <- 3
b2 <- 4
b3 <- -3
e <- rnorm(n.sample, 0, 5)
y <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
sim.dat <- data.frame(y, x1, x2)

Before advancing, this is what the simulated data look like:

par(mfrow = c(1, 3))
hist(sim.dat$y, main="Histogram of Y", breaks=20, col="#336699")
hist(sim.dat$x1, main="Histogram of X1", breaks=20, col="#336699")
hist(sim.dat$x2, main="Histogram of X2", breaks=20, col="#336699")

par(mfrow = c(1, 1))

For convenience, you could also use the multi.hist() function from the “psych” package. It automatically adds a density curve (dashed line) and a normal density plot (dotted line).

library(psych)
multi.hist(sim.dat, nrow = 1, bcol="#336699",breaks=15)

Fitting a model using what we know about the data-generating process gives us:
mod.sim <- lm(y ~ x1 * x2, dat = sim.dat)
stargazer::stargazer(mod.sim, type='html', summary=TRUE,report = "vc*stp",ci=TRUE)
Dependent variable:
y
x1 3.872***
(2.484, 5.260)
t = 5.467
p = 0.00000
x2 3.955***
(3.630, 4.281)
t = 23.790
p = 0.000
x1:x2 -2.944***
(-3.418, -2.469)
t = -12.160
p = 0.000
Constant 4.789***
(3.824, 5.754)
t = 9.726
p = 0.000
Observations 200
R2 0.762
Adjusted R2 0.758
Residual Std. Error 4.997 (df = 196)
F Statistic 208.613*** (df = 3; 196)
Note: p<0.1; p<0.05; p<0.01

First, I plot the relationship between the continuous predictor x_2 and y for all observations where x_1=0. The regression line is defined by an intercept \alpha and a slope \beta_2, as we calculated above. I can extract these from the lm object above using the coef() function. I use the index in square brackets to extract the coefficient I need.

plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y, 
     col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25), pch = 19,
     xlab = "x2", ylab = "y")
abline(a = coef(mod.sim)[1], b = coef(mod.sim)[3], col = "blue", pch = 19, lwd = 2)

For all cases where x_{1}=0, what is the relationship between x2 and y? Next, we can add the data points where x_{1}=1, and add the separate regression line for these points:

plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y, 
     pch = 19, xlab = "x2", ylab = "y", col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25))
abline(a = coef(mod.sim)[1], b = coef(mod.sim)[3], col = "blue", lwd = 2)
points(x = sim.dat[sim.dat$x1 == 1, ]$x2, y = sim.dat[sim.dat$x1 == 1, ]$y, 
       col = rgb(red = 1, green = 0, blue = 0, alpha = 0.25), pch = 19)
abline(a = coef(mod.sim)[1] + coef(mod.sim)[2], b = coef(mod.sim)[3] + coef(mod.sim)[4], 
       col = "red", lwd = 2)

For all cases where x_{1}=1, what is the relationship between x2 and y?

2.1 Specifying an interaction

  1. Although the tutorial does not discuss this, in the prevalent case of using null hypothesis significance tests on interaction effects, interaction terms should be derived from theory.
  2. In R, you specify an interaction term by putting an asterisk between the two constitutive terms: x_1 \times x_2

See Brambor et al. (2006) for more details. What would it imply if your model only included \beta_{3}x_{1}x_{2}? Omitting a coefficient is equivalent to setting \beta_1=0 and \beta_2=0. See for yourself: R will include the main effects regardless of specifications if an interaction term is present.

mod.sim2 <- lm(y ~ x1:x2, data = sim.dat)
stargazer::stargazer(mod.sim2, type='html', summary=TRUE,report = "vc*stp",ci=TRUE)
Dependent variable:
y
x1:x2 0.959***
(0.270, 1.647)
t = 2.729
p = 0.007
Constant 6.656***
(5.268, 8.043)
t = 9.404
p = 0.000
Observations 200
R2 0.036
Adjusted R2 0.031
Residual Std. Error 9.995 (df = 198)
F Statistic 7.448*** (df = 1; 198)
Note: p<0.1; p<0.05; p<0.01
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y, 
     pch = 19, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25),
     xlab = "x2", ylab = "y")
points(x = sim.dat[sim.dat$x1 == 1, ]$x2, y = sim.dat[sim.dat$x1 == 1, ]$y, 
       col = rgb(red = 1, green = 0, blue = 0, alpha = 0.25), pch = 19)
abline(a = coef(mod.sim2)[1], b = coef(mod.sim2)[2], lwd = 2)

3 Example 2: Two continuous variables

The first example illustrates the general logic of interaction terms. It carries over directly into the case of an interaction term between two continuous variables. Let’s now simulate another (fake) dataset, with a continuous outcome y being the income of professional baseball players. We can assume that x_{1} is a continuous variable and distributed normally with a mean of 2 and a standard deviation of 5 (e.g., a standardized measure of injury risk of each player). As previously, x_{2} is a continuous variable (such as standardized & centered age), ranging from −5 to +5. We might be thinking of a conditional effect, where older athletes make more — but only if they are rated as low injury risk. Older athletes with higher injury risk might be making less money. With this configuration,

  1. \beta_{1} is the effect of x_{1} on y when x_{2} is 0:
    1. \hat{y}=\beta_{0}+\beta_{1}x_{1}+\beta_{2} \times 0+\beta_{3}x_{1} \times 0
    2. \hat{y}=\beta_{0}+\beta_{1}x_{1}+0+0
  2. \beta_{2} is the effect of x_{2} on y when x_{1} is 0:
    1. \hat{y}=\beta_{0}+\beta_{1} \times 0+\beta_{2}x_{2}+\beta_{3} \times 0 \times x_{2}
    2. \hat{y}=\beta_{0}+\beta_{2}x_{2}+0
  3. When both x_{1} and x_{2} are not 0, \beta_{3} becomes important, and the effect of x_{1} now varies with the value of x_{2}. We can again plug in 1 for x_{1} and simplify the equation as follows: 1.\hat{y}=\beta_{0}+\beta_{1} \times 1+\beta_{2}x_{2}+\beta_{3} \times 1 \times x_{2}
    1. \hat{y}=\beta_{0}+\beta_{1}+\beta_{2}x_{2}+\beta_{3} \times x_{2}
    2. \hat{y}=(\beta_{0}+\beta_{1})+(\beta_{2}+\beta_{3}) \times x_{2}
  4. But x_{1} takes on many other values than just 1. The more general equation for y is then (cf. Brambor et al. p. 11)
    1. y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{1}x_{2}
    2. In this case, we cannot define just “two” equations (for x_{1}=1 and x_{1}=0). Instead, the effect of x_{1} varies with x_{2}, and vice versa.
    3. To express this, we use the term “marginal effects” or “conditional effects”: the effect of x_{1} conditional on the value of x_{2}.
    4. This marginal effect for x_{1} is a slope, and this slope is the sum of the coefficient for x_{1} and the product of \beta_{3} and the value of x_{2}: \beta_{1}+\beta_{3} \times x_{2}.
    5. Similarly, the marginal effect for x_{2} is \beta_{2}+\beta_{3} \times x_{1}.

I use the same simulation setup above, but make x_{1} a continuous variable, normally distributed with a mean of 2 and a standard deviation of 5.

set.seed(123)
n.sample <- 200
x1 <- rnorm(n.sample, mean = 2, sd = 5)
x2 <- runif(n.sample, -5, 5)
a <- 5
b1 <- 3
b2 <- 4
b3 <- -3
e <- rnorm(n.sample, 0, 20)
y <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
sim.dat2 <- data.frame(y, x1, x2)

Before advancing, this is what the simulated data look like:

multi.hist(sim.dat2, nrow = 1, breaks=15)

Fitting a model using what we know about the data-generating process gives us:

mod.sim3 <- lm(y ~ x1 * x2, dat = sim.dat2)
stargazer::stargazer(mod.sim3, type='html', summary=TRUE,report = "vc*stp",ci=TRUE)
Dependent variable:
y
x1 2.592***
(1.993, 3.192)
t = 8.475
p = 0.000
x2 4.103***
(3.059, 5.146)
t = 7.704
p = 0.000
x1:x2 -3.038***
(-3.238, -2.839)
t = -29.883
p = 0.000
Constant 6.495***
(3.441, 9.549)
t = 4.168
p = 0.00005
Observations 200
R2 0.835
Adjusted R2 0.832
Residual Std. Error 20.343 (df = 196)
F Statistic 330.285*** (df = 3; 196)
Note: p<0.1; p<0.05; p<0.01

We can now calculate the effect of x_{1} on y at different levels of x_{2}, say, across the range of x_{2} with increments of 1. In R, I create this sequence using the seq() function.

x2.sim <- seq(from = -5, to = 5, by = 1)
x2.sim
 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
eff.x1 <- coef(mod.sim3)[2] + coef(mod.sim3)[4] * x2.sim

I can now list the effect of x1 at different levels of x_{2}:

eff.dat <- data.frame(x2.sim, eff.x1)
eff.dat %>% kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
x2.sim eff.x1
-5 17.7837936
-4 14.7454971
-3 11.7072007
-2 8.6689042
-1 5.6306077
0 2.5923113
1 -0.4459852
2 -3.4842817
3 -6.5225781
4 -9.5608746
5 -12.5991711

The object eff.x1 is now the coefficient of x_{1} at the respective levels of x_{2}. x_{2} in this context is also called the “moderator” or “moderating variable” because it moderates the effect of x_{1}. You can see that x_{1} exhibits a positive effect on y when x_{2} is approximately smaller than 1, at which point the effect of x_{1} on y turns negative.

You can now visualize this, just as you did in the binary-continuous interaction above. Because x_{2}, the moderating variable, is now continuous, there is an (almost) infinite number of separate regression lines for x_{1} and y: each individual value of x_{2} creates a separate regression line (with a separate slope coefficient). The following scatterplot plots observations against their values on x_{1} and 7 and colors them based on their value of x_{2}, as in the previous example.

rbPal <- colorRampPalette(colors = c("red", "blue"))
sim.dat2$x2.col <- rbPal(10)[as.numeric(cut(sim.dat2$x2,breaks = 10))]
plot(x = sim.dat2$x1, y = sim.dat2$y, col = sim.dat2$x2.col,
     pch = 19, xlab = "x1", ylab = "y")

Then, I’m adding 10 regression lines for the 10 values of x_{2} that I just calculated. The lines are also colored according to the values of x_{2}.

eff.dat$x2.col <- rbPal(10)[as.numeric(cut(eff.dat$x2,breaks = 10))]
plot(x = sim.dat2$x1, y = sim.dat2$y, col = sim.dat2$x2.col,
     pch = 19, xlab = "x1", ylab = "y")
apply(eff.dat, 1, function(x) abline(a = coef(mod.sim3)[1], b = x[2], col = x[3]));

NULL

You can see that at low values of x_{2} (red), the relationship between x_{1} and y is positive (upward lines), whereas at higher values of x_{2} (blue), the relationship is negative (downward lines).

4 Why it is important to inspect your data first

As you’ve heard before, and as Hainmueller et al.’s working paper imply, you should always inspect your data. This also applies to interactions. Always investigate whether:

  1. Your data have observations across the whole range of each of the constitutive terms. It is best to do this by creating simple scatterplots of the form you see in Hainmueller et al.
  2. The raw relationship between one explanatory variable and the outcome variable is linear at different levels of the moderating variable. This is easy for binary moderators. For continuous moderators, Hainmueller et al. suggest splitting the observations in low, medium, and high levels of the continuous moderator.

You should see this as an opportunity to learn something about your data rather than a nuisance.

Here is an example of how I would inspect my data from the first model, which contained an interaction between a dichotomous (x_{1}) and continuous (x_{2}) variable:

par(mfrow = c(1, 2))
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2,
     y = sim.dat[sim.dat$x1 == 0, ]$y,
     main = "x1 = 0", xlab = "x2", ylab = "y")
plot(x = sim.dat[sim.dat$x1 == 1, ]$x2,
     y = sim.dat[sim.dat$x1 == 1, ]$y,
     main = "x1 = 1", xlab = "x2", ylab = "y")

par(mfrow = c(1, 1))

I see no reason to be concerned about the distribution of x_{2} or the linearity of the relationship between x_{2} and y, and this makes sense, given how I simulated the data. But see Hainmueller et al. for examples where this does become a problem.

Next, here is an example of how I would inspect my data from the third model above, which contained an interaction between a continuous (x_{1}) and continuous (x_{2}) variable. First, I split the moderator (again, I choose x_{1}) in three groups. For this, I use the cut() function (see the R Cookbook for more info).

sim.dat2$x1_tri <- cut(x = sim.dat2$x1, 
                       breaks = quantile(x = sim.dat2$x1, probs = c(0, 0.33, 0.66, 1)),
                       include.lowest = TRUE)
table(sim.dat2$x1_tri) %>% kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
Var1 Freq
[-9.55,-0.151] 66
(-0.151,3.66] 66
(3.66,18.2] 68
par(mfrow = c(1, 3))
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$x2,
     y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$y,
     main = paste("x1 = ", levels(sim.dat2$x1_tri)[1], sep = ""),
     xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$x2,
     y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$y,
     main = paste("x1 = ", levels(sim.dat2$x1_tri)[2], sep = ""),
     xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$x2,
     y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$y,
     main = paste("x1 = ", levels(sim.dat2$x1_tri)[3], sep = ""),
     xlab = "x2", ylab = "y")

par(mfrow = c(1, 1))

Again, given how I simulated these data, there is no reason to be concerned about the distribution of x_{2} or the linearity of the relationship between x_{2} and y.

5 Obtaining standard errors for marginal effects

As usual, you should evaluate the variance around coefficients. The online appendix to Brambor et al. (2006) gives you the formula for this variance:

For the variance of the marginal/conditional effect of x_{1}, use this equation:

\mathbb{V}ar=\mathbb{V}ar(\beta_{1})+x_{2}^{2} \times \mathbb{V}ar(\beta_{3})+2 \times x_{2} \times \mathbb{C}ov(\beta_{1},\beta_{3})

The standard error is then easily calculated as the square root of the variance.

Note that this equation makes use of the covariance of estimates. You learned about variance and covariance on Day 10, and you can access the variance-covariance matrix with the vcov() function. You can then extract cells of the matrix by using the familiar square brackets, where the first number stands for rows and the second for columns.

vcov(mod.sim3) %>% kbl %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
(Intercept) x1 x2 x1:x2
(Intercept) 2.4277431 -0.1831418 -0.0009989 -0.0018731
x1 -0.1831418 0.0935660 -0.0018849 0.0007609
x2 -0.0009989 -0.0018849 0.2835933 -0.0200801
x1:x2 -0.0018731 0.0007609 -0.0200801 0.0103373

Now, you can obtain the standard error of the respective coefficients by taking the square root of the variance. In this code, I’m nesting the equation above in the sqrt() function.

eff.dat$se.eff.x1 <- sqrt(vcov(mod.sim3)[2, 2] + 
                            eff.dat$x2.sim^2 * vcov(mod.sim3)[4, 4] + 
                            2 * eff.dat$x2.sim * vcov(mod.sim3)[2, 4])
eff.dat %>% kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F,
        position = "center")
x2.sim eff.x1 x2.col se.eff.x1
-5 17.7837936 #FF0000 0.5868481
-4 14.7454971 #FF0000 0.5028682
-3 11.7072007 #E2001C 0.4266577
-2 8.6689042 #C60038 0.3631416
-1 5.6306077 #AA0055 0.3199713
0 2.5923113 #8D0071 0.3058857
1 -0.4459852 #71008D 0.3246924
2 -3.4842817 #5500AA 0.3714283
3 -6.5225781 #3800C6 0.4372270
4 -9.5608746 #1C00E2 0.5148307
5 -12.5991711 #0000FF 0.5996737

6 Plotting the marginal/conditional effect

It is usually (always, really) more useful to plot the marginal effect of a variable across the range of the second variable in the interaction, rather than plotting separate regression lines as I did above. I strongly recommend this approach. You already have the tools to do this:

plot(x = eff.dat$x2.sim, y = eff.dat$eff.x1, type = "l", xlab = "x2", 
     ylab = "Conditional effect of x1")
abline(h = 0, col = "grey", lty = "dashed")

The solid line shows the conditional effect of x_{1} across the range of x_{2}. You can add 95\% confidence intervals by plotting lines representing the conditional effect \pm 1.96 \times SE.

plot(x = eff.dat$x2.sim, y = eff.dat$eff.x1, type = "l", xlab = "x2", 
     ylab = "Conditional effect of x1")
abline(h = 0, col = "grey", lty = "dashed")
lines(x = eff.dat$x2.sim, y = eff.dat$eff.x1 + 1.96 * eff.dat$se.eff.x1, lty = "dashed", col="#DD1E2F")
lines(x = eff.dat$x2.sim, y = eff.dat$eff.x1 - 1.96 * eff.dat$se.eff.x1, lty = "dashed", col="#DD1E2F")
abline(h = 0, lty = 2, col = "grey")

7 Example 3: “Electoral Institutions, Unemployment and Extreme Right Parties: A Correction.”

The final example uses real-world data on the electoral success of extreme right-wing parties in 16 countries over 102 elections. These data were used in Matt Golder’s BJPS article “Electoral Institutions, Unemployment and Extreme Right Parties: A Correction.” (see Brambor et al. 2006 for the full citation).

ei.dat <- rio::import("../datasets/golder.2003.csv")
summary(ei.dat) %>% kbl() %>% 
    kable_styling(bootstrap_options = "striped", full_width = F,
        position = "center")
country year unemp enp lerps thresh
Length:102 Min. :1970 Min. : 0.300 Min. :1.72 Min. :0.0000 Min. : 0.670
Class :character 1st Qu.:1975 1st Qu.: 2.125 1st Qu.:2.79 1st Qu.:0.0000 1st Qu.: 2.600
Mode :character Median :1980 Median : 5.300 Median :3.40 Median :0.4700 Median : 5.000
NA Mean :1980 Mean : 5.811 Mean :3.59 Mean :0.9208 Mean : 8.807
NA 3rd Qu.:1985 3rd Qu.: 8.375 3rd Qu.:4.63 3rd Qu.:1.8284 3rd Qu.: 9.875
NA Max. :1990 Max. :20.800 Max. :5.10 Max. :3.3142 Max. :35.000
par(mfrow = c(2, 2))
hist(ei.dat$lerps, main="Histogram of Log voteshare ERP", breaks=20, col="#336699")
hist(ei.dat$thresh, main="Histogram of Effective Electoral Threshold", breaks=20, col="#336699")
hist(ei.dat$enp, main="Histogram of Effective Number of parties", breaks=20, col="#336699")
hist(ei.dat$unemp, main="Histogram of Unemployment", breaks=20, col="#336699")

par(mfrow = c(1, 1))
Variable Description
lerps Log of extreme right percentage support + 1
thresh Effective threshold of representation and exclusion in the political system
enp Effective number of parties
unemp Unemployment rate
country Country name
year Election year

This study aims to determine the relationship between electoral support for extreme right-wing parties (the outcome variable), the threshold for representation in the political system (the first predictor), and the effective number of parties (the second predictor). The author hypothesized that the effect of representation thresholds might vary depending on the effective number of parties. The unemployment rate at the time of the election is a control variable. Because elections in the same country might be subject to idiosyncratic dynamics, the author includes a dummy variable for each country (cf. our discussion on Days 9 and 10).

After fitting the model, I display the results using the familiar screenreg() function from the “texreg” package. I make use of the omit.coef option to avoid printing of the coefficients for each country dummy variable.

m3 <- lm(lerps ~ thresh + enp + thresh * enp + unemp + factor(country), data = ei.dat)
stargazer::stargazer(m3, type='html', summary=TRUE,report = "vc*stp",ci=TRUE)
Dependent variable:
lerps
thresh 0.267***
(0.143, 0.391)
t = 4.216
p = 0.0001
enp 1.159***
(0.311, 2.008)
t = 2.677
p = 0.009
unemp 0.070***
(0.040, 0.100)
t = 4.541
p = 0.00002
factor(country)belg -3.513***
(-5.354, -1.672)
t = -3.740
p = 0.0004
factor(country)denm -2.844***
(-4.904, -0.785)
t = -2.706
p = 0.009
factor(country)finl -3.889***
(-6.039, -1.740)
t = -3.546
p = 0.001
factor(country)fran -0.354
(-2.124, 1.415)
t = -0.393
p = 0.696
factor(country)germ -2.280***
(-3.048, -1.511)
t = -5.816
p = 0.00000
factor(country)gree -2.365***
(-2.924, -1.806)
t = -8.296
p = 0.000
factor(country)irel -2.886***
(-3.819, -1.953)
t = -6.062
p = 0.00000
factor(country)ital -1.600***
(-2.685, -0.515)
t = -2.891
p = 0.005
factor(country)neth -4.452***
(-6.268, -2.635)
t = -4.803
p = 0.00001
factor(country)norw -0.948*
(-1.997, 0.101)
t = -1.771
p = 0.081
factor(country)port -2.432***
(-3.167, -1.697)
t = -6.484
p = 0.000
factor(country)spai -0.271
(-1.035, 0.492)
t = -0.696
p = 0.489
factor(country)swed -2.827***
(-3.752, -1.902)
t = -5.993
p = 0.00000
factor(country)swit -1.670
(-3.914, 0.573)
t = -1.459
p = 0.149
factor(country)uk -3.796***
(-5.296, -2.296)
t = -4.960
p = 0.00001
thresh:enp -0.099***
(-0.140, -0.057)
t = -4.667
p = 0.00002
Constant -0.967
(-3.120, 1.187)
t = -0.880
p = 0.382
Observations 102
R2 0.853
Adjusted R2 0.819
Residual Std. Error 0.428 (df = 82)
F Statistic 25.009*** (df = 19; 82)
Note: p<0.1; p<0.05; p<0.01

Rather than trying to interpret these results from the regression table, I calculate the marginal effect of electoral thresholds across the range of the effective number of parties. This step is exactly analogous to what I did with simulated data above.

thresh_sim <- seq(from = min(ei.dat$thresh), to = max(ei.dat$thresh), length.out = 50)
enp_sim <- seq(from = min(ei.dat$enp), to = max(ei.dat$enp), length.out = 50)
eff_thresh <- coef(m3)[2] + coef(m3)[20] * enp_sim
eff_thresh_se <- sqrt(vcov(m3)[2, 2] + enp_sim^2 * vcov(m3)[20, 20] + 2 * enp_sim * vcov(m3)[2, 20])
plot(x = enp_sim, y = eff_thresh, type = "l", 
     xlab = "Effective number of parties",
     ylab = "Conditional effect of the electoral threshold")
lines(x = enp_sim, y = eff_thresh + 1.96 * eff_thresh_se, lty = "dashed", col="#DD1E2F")
lines(x = enp_sim, y = eff_thresh - 1.96 * eff_thresh_se, lty = "dashed", col="#DD1E2F")
abline(h = 0, lty = "dashed", col="#336699")

What do you conclude from this figure? What do these results say about the author’s hypothesis?

8 Canned R functions to plot marginal effects in OLS

Rather than calculating marginal effects by hand, you might want to use R functions to plot marginal effects with less effort.

8.1 DAintfun2()

One good option is Dave Armstrong’s DAintfun2() function, which is part of his “DAMisc” package. You need to first install the package, then load it, and then specify the function with at least the following arguments:

# install.packages("DAMisc")
library(DAMisc)
DAintfun2(m3, varnames = c("thresh", "enp"), 
                  rug = TRUE, hist = TRUE)

DAintfun2() by default plots both the marginal effect of the first variable as well as that of the second variable. It also allows you to include an indicator of the actual distribution of the variables, via a histogram or a rug plot. With these supplementary plots, you can immediately see whether the estimated effect of a variable corresponds to real observations at that value of the moderating variable.

set.seed(123)
n.sample <- 200
x1 <- rbinom(n.sample, size = 1, prob = 0.5)
x2 <- runif(n.sample, -5, 5)
a <- 5
b1 <- 3
b2 <- 4
b3 <- -3
e <- rnorm(n.sample, 0, 5)
y <- a + b1 * x1 + b2 * x2 + e
sim.dat <- data.frame(y, x1, x2)
head(sim.dat)
          y x1         x2
1  5.543093  0 -2.6127397
2 33.056422  1  4.6235894
3  7.728904  0  1.0136573
4 11.317159  1  0.1502973
5  2.031234  1 -0.9742666
6 17.828627  0  3.8024654